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Abstract 

Transportation distances have been used for more than a decade 
now in machine learning to compare histograms of features. They have 
one parameter: the ground metric, which can be any metric between 
the features themselves. As is the case for all parameterized distances, 
transportation distances can only prove useful in practice when this 
parameter is carefully chosen. To date, the only option available to 
practitioners to set the ground metric parameter was to rely on a pri- 
ori knowledge of the features, which limited considerably the scope of 
application of transportation distances. We propose to lift this limita- 
tion and consider instead algorithms that can learn the ground metric 
using only a training set of labeled histograms. We call this approach 
ground metric learning. We formulate the problem of learning the 
ground metric as the minimization of the difference of two polyhedral 
convex functions over a convex set of distance matrices. We follow the 
presentation of our algorithms with promising experimental results on 
binary classification tasks using GIST descriptors of images taken in 
the Caltech-256 set. 



1 Introduction 

We consider in this paper the problem of supervised metric learning on 
normalized histograms. Normalized histograms arise frequently in natu- 
ral language processing, computer vision, bioinformatics and more gener- 
ally areas involving complex datatypes. Objects of interest in such areas 
are usually simplified and each represented as a bag of smaller features. 
The occurrence frequencies of each of these features in the considered ob- 
ject can then be represented as a histogram. For instance, the repre senta- 
tion of images as histograms of pixel colors, SIFT or GIST features ( Lowel 



19991 . lOliva and Torralball200ll. iDouze et al.ll2009l'): texts as bags-of-wprds o r 
topic allocations ( Joachims 2002 . Blei et a.1. 20031 . Blei and Laffertv 20091 ): 



sequences as ra- grams counts (ILeslie et al.ll2002l ) and graphs as histograms 
of subgraphs (jKashima et al.l 120031 ) all follow this principle. 

Various distances have been proposed i n the statistics and rn achine learn- 
ing li teratures to compare two histograms ( Deza and Deza 20091 . §14), ( Rachev 
199ll ). Our focus is in this paper is on the f amily of trans porta tion distances 



which is both well motivated t heoreticallv (IVillani 



20031. S 71. (iRachev 



1991 



.5) a nd works well empirically ( Rubner et al. 1997 : 2000, Pele and Werman 
20091 ). Transportation distances are particularly popular in computer vision, 
where, after the influential work of Rubner et al. ( 19971 ). they were called 
Earth Mover's Distances (EMD). 

Transportation distances in machine learning can be thought of as meta- 
distances that build upon a metric on the features to form a distance on his- 
tograms of features. Such a metric, which is known in the computer vision 
literature as the ground metric^, i s the unique parame ter of transportation 



distances. In their seminal paper, iRubner et al 



(l200d l argue that, "in gi 



en- 



eral, the ground distance can be any distance and will be chosen according 
to the problem at hand". To our knowledge, the ground metric has always 
been considered a priori in all applications of EMD in machine learning. 
To be more precise, EMD has only been applied to datasets where such a 
metric was available and motivated by prior knowledge. We argue that this 
is problematic in two senses: first, this restriction limits the application of 
transportation distances to problems where such a knowledge exists. Second, 
even when such an a priori knowledge is available, we argue that there cannot 
be a "universal" ground metric that will be suitable for all learning problems 
involving histograms on such features. As with all parameters in machine 
learning algorithms, the ground metric should be selected adaptively. Our 
goal in this paper is to propose ground metric learning algorithms to do so. 

This paper is organized as follows: After providing some background on 
transportation distances in Section [21 we propose in Section [3] a criterion - 
a difference of convex function - to select a ground metric given a training 
set of histograms and a similarity measure between these histograms. We 
then show how to obtain local minima for that criterion using a subgradi- 
ent descent algorithm in Section [4l We propose different starting points to 
initialize this descent in Section [5j We provide a review of other relevant 



^ Since the terms metric and distance are interchangeable mathematically speaking, we 
will always use the term metric for a metric between features and the term distance for 
the resulting transportation distance between histograms, or more generally any other 
distance on histograms. 
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distances and metric learning tech niques in Section [6l in particular Maha- 
lanobis metric learning techniques (IXing et al.ll2003l . IWeinberger et al.ll2006l . 
Weinberger and Saul 20091 . Davis et al. 2007 ) which have inspired much of 



this work. We provide empirical evidence in Section [7] that the distances 
proposed in this paper compare favorably to competing techniques. We 
conclude this paper in Section [8] by providing a few research avenues. 



Notations 

We use upper case letters A,B,... for d x d matrices. Bold upper case 
letters A, B, . . . are used for larger matrices; lower case letters r, c, . . . are 
used for scalar numbers or vectors of W^. An upper case letter M and its 
bold lower case m stand for the same matrix written in d x d matrix form 
or vector form by stacking successively all its column vectors from the 
left-most on the top to the right-most at the bottom. The notations m 
and m stand respectively for the strict upper and lower triangular parts 
of M expressed as vectors of size (g). The order in which these elements 
are enumerated must be coherent in the sense that the upper triangular 
part of M'^ expressed as a vector must be equal to m. Finally we use the 
Frobenius dot-product for both matrix and vector representations, written 
as ='tr(^^5) = a^b. 



2 Optimal Transportation Between Histograms 



We recall in this section a few basic facts about mass transportation for two 
histog rams. A more general and technical introduction is provided bv lVillani 
(|2003l . Introduction &: §7); practical insight s and motivat i on for its appli- 



cation in machine learning can be found in iRubner et all (|2000l ): a recent 



rev iew of different extensio ns and particular cases of EMD can be found 
(IPele and Wermanll2009l . §2). 



m 



2.1 Transportation Polytopes 

For two scalar histograms r and c of sum 1 and dimension d, represented 
in the following as column vectors r = (ri, . . . , r^)'^ and c = (ci, . . . , c^)"^ of 
the canonical simplex S^-i = {« G M!^ | ||tt||i = 1} of dimension d — 1, the 
polytope C/(r, c) of transportation plans that map r to c is the set of d x d 
matrices with coefficients in R"*" such that their row and columns marginals 
are equal to r and c respectively, that is, writing 1^ € for the column 
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vector of ones, 



U{r,c) = {F £ 



ndxd 



Fir. 



c}. 



U{r, c) is a polytope of dimension d? — 2d+l in the general case where r and 
c have positive coordinates. U (r, c) is also known in the o perations research 



and s tatistical literatures as the set of transportation plans (iRachev and Riischendori 



1998 ) and contingency tables or two-way tables with fixed margins (jPiaconis and Efron 



19851 ). Given two histograms r and c, we define the following function of a 
d X d real matrix A: 

Grc{A)= min {A,X). (1) 

XeU{r,c) 

Equation ([1]) describes a linear program whose feasible set is defined by r 
and c and whose cost is parameterized by A. Grc is a positive homogeneous 
function, that is Grc{tA) = tGrc{A) for t > 0. When M is a matrix taken 
in the pointed, convex and polyhedral cone of metric matrices, 



M 



def 



M G 



r,dxd 



: VI < i,j,k < d,Mii = 0,Mi. 



Mji,Mij <Mik + Mj 



kj 



c 



the q uantity Grc{M) is known as the Kantorovich- Rubinstein distance (IVillani 
20031 . §7) between r and c. To highlight the fact that Grc{M) can also be 



seen as a the evaluation of a function of r and c parameterized by M, we 



will use the notation dM{r, c) Grc{M). 



adxd 



2.2 Transportation Distances 

The function djv/ : x S^-i — )• M parameterized by M has the fol- 

lowing properties: since M has a null diagonal dMir,r) is always zero; 
by nonnegativity of M, dM{r,c) > 0; by symmetry of M, dM is itself 
a symmetric function in its two arguments. More generally, djM is a dis- 
tance bet ween histogr ams whenever M is itself a metric, namely whenever 
M eM (IVillanill200.i Theo. 7.3). The distance dM bears many n ames and 
has many variations: 1 -Wasserstein, Monge-Kant orovich, Mallo w 's dMallowd 



1972, Levina and Bickel 2001 



sion 



„ Earth Mover's (iRubner et al.l l2000l'l in vr 

applications. Rubner et aP ( 200d ) and more recently Pele and Werman 



(|2009l ) have also proposed to extend the transportation distance to compare 
un-normalized histograms. Simply put, these extensions compute a distance 
between two unnormalized histograms u and v by combining any difference 
in the total mass of u and v with the optimal transportation plan that can 
carry the whole mass of u onto v if < ||t^||i or v onto u if ||w||i < 
We will not consider such extensions in this work; we believe however that 
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the approaches proposed later in this paper can be extended to handle EMD 
distances for unnormalized histograms. 

o?M can be computed as the solution of the following Linear Program 
(LP), 

dM (r, c) = minimize 

r 



T 
m X 



subject to Ax = 
X > 



(PO) 



where A is the (2d — 1) x d? matrix that encodes the row-sum and column- 
sum constraints for X to be in U{r, c) as 



llxd<^Id 
Id <^ llxd 



(8) is Kronecker's product and the lower subscript [-j^ in a matrix (resp. a 
vector) means that its last line (resp. element) has been removed. This 
modification is carried out to make sure that all constraints described by A 
are independent, or equivalently that A-^ is not rank def icient. This LP can 



be solved using the network simplex (IFord and Fulkersonlll962l ) or through 



more specialized network flow algorithms ( Ahuia et al. 19931 . ^9) 



2.3 Properties of Grc 

Because its feasible set U{r, c) is a bounded polytope and its objective is lin- 
ear, Problem ()POp has an optimal solution in the finite set Ex(r, c) of extreme 
points of U{r, c). Grc is thus the minimum of a finite set of linear functions 



and is by extension piecewise linear and concave (iBovd and Vandenberghe 
2004 . §3.2.3). Its gradient is equal to VGrr = x* whenever an optim al solu- 
tion X* to Problem ()POP is unique (jBertsimas and Tsitsikhd \199^ . §5.4). 
More generally and regardless of the uniqueness of x*, any optimal so- 
lution X* of Problem (IPOl) is in the sub-differential dGrdM) of Grc at 



M ( Bertsimas and Tsitsikhd 119971 . Lem.11.4). We use this property later 
in Section [H when we optimize the criteria considered in the section below. 



3 Criteria for Ground Metric Learning 

We define in this section a family of criteria to quantify the relevance of a 
ground metric for a given task, using a training dataset of histograms with 
additional information. 



5 



3.1 Training Set: Histograms and Side Information 

Suppose now that we are given a family (ri, • • • ,rn) of histograms in the 
canonical simplex with a corresponding similarity matrix [u!ij]i<i^j<n £ M"^" 
which quantifies how similar rj and rj arc: ujij is large and positive whenever 
Tj and rj describe similar objects and small and negative for dissimilar ob- 
jects. We assume that this similarity is symmetric, Uij = ojji. The similarity 
of an object with itself will not be considered in the following, so we simply 
set iOii = for 1 < i < n. 

Let us give more intuition on how these weights LOij may be set in prac- 
tice. In the most simple case, these weights may reflect a class taxonomy 
and be set to coij > whenever and rj come from the same class and 
ojij < for two different classes. This is the setting we consider in our 
experiments later in this paper. Such weights may be also inferred from 
a hierarchical taxonomy: each weight Uij corresponding to two histograms 
could for instance reflect how close the respective classes of these histograms 
lie in the tree of classes. 

Let us introduce more notations before moving on to the next section. 
Since Uij = uji and Gnrj = Grjr^ we restrict the set of pairs of indices 
to 

'^={{h3) \ hJ e {Ir-- ,n},i <j}. 
We also introduce two subsets of I: 

£+ ={{i, j)eX\ ujij > 0}; £. "^^{{i, j)&X\ ujij < 0}, 

the subsets of similar and dissimilar histograms. Finally, we write Gij for 
the GriTj functions. 

3.2 A Local Criterion to Select the Ground Metric 

We propose to formulate the ground metric learning problem as finding 
a metric M G such that the transportation distance (Im induced by 
this metric agrees with the weights lo. More precisely, this criterion will 
favor metrics for which, for a given pair of similar histograms rj and Vj 
(namely ujij > 0), the resulting distance Gij{M) is small. Conversely, for a 
given pair of dissimilar histograms r.j and rj (namely uJij < 0), the resulting 
distance Gij{M) should be large. The criterion should balance these two 
requirements, and in particular favor ground metrics for which these two 
ideas hold for pairs (i,j) such that \ujij\ is large. 
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Prom a formal perspective, any criterion to select M should consider the 
family of (2) pairs 

{(wii,Gii(M)), ••• ,K_i„,G„_i„(M))}. 

Since the ordering of the histograms should not influence the criterion, only 
symmetric functions (M x Srf_i)(2) ^ M of the couples of variables above 
should be considered. We propose in this paper a family of simple criteria: 
the average value oi uJijdMiri,rj), 

Coo(M) ='2 ^ coijGijiM), 
and a restriction of such an average to neighboring points, 

n 

CUM)''^'^S^,{M) + ST,{M), 
1=1 

where for each index i, the weighted sums of distances of its similar and 
dissimilar neighbors are considered respectively in 

^ifeW = E ^^jGviM), and Sr,{Mf^' ^^JG^j{M). (2) 

These sums are computed using the sets N^^ and N^, which stand for 
the indices of any k nearest neighbours of using distance dM, not nec- 
essarily unique, and whose indices are taken respectively in the subsets 

£i+'={j\{hj) or {j,i) e S+} and Si-'^={j\{i,j) or {j,i) e S^}. We adopt 
the convention that A'^^ = whenever k is larger than the cardinality 
of and follow the same convention for N^. This convention makes 
our notation Cqo consistent with the definition of Ck since one can indeed 
check that Ck = Coo for A; large enough, and notably for k > n. Since the 
techniques we propose below apply to both cases where k is finite or infi- 
nite, we only consider in the following an extended index € [1, 00] and its 
corresponding criterion Ck- 

3.3 Metrics of Interest 

Because Ck is positively homogeneous, the problem of minimizing Ck over 
the pointed cone M. is either unbounded or trivially solved for M = 0. To 
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get around this issue, we restrict our search to the intersection of and 
the unit sphere in M'^^'^ of a suitable matrix norm, that is 

Mi=MnBi, (3) 

where Bi = {A € R'^^'^ | ||^||. < 1} is defined by an arbitrary matrix norm 
\\A\\. such that the unit bah Bi is convex. Using criteria Ck, learning a 
ground metric from the family of points (ri, • • • , r„) and weights (wij)(jj)gx 
boils down to finding an optimal solution to problem 

min Cfc(M). (PI) 

Problem (|Pip has (2) variables, one for each upper-diagonal term in M. The 
feasible set A^i is closed, convex, bounded and Cj- is piecewise linear. As a 
consequence. Problem (jPip admits at least one optimal solution. 



4 C/c as a Difference of Convex Functions 

Since each funct ion Gjj is concave, Ck c an be cast as a Difference of Convex 
(DC) functions (|Horst and ThoailllQQfll ): 



Cfe(M)l2?5-(M)- -5+(M) 

where both {M) = YJi=i Sr^iM) and -S+(M) =f ^^^^ -S+(M) are con- 
vex, by virtue of the convexity of each of the terms and -Sfj. defined in 
Equation This follows from the concavity of each function Gij and the 
fact that such functions are weighted by negative factors, uoij for (i, j) G £~ 
and -ujij for (i, j) G £+■ We propose in this section an algorithm to obtain a 
local minimizer of Problem (jPl|) that takes advantage of this decomposition. 

4.1 Subdifferentiability of Ck 

The gradient of Ck computed at a given metric matrix M is 
VCfc(M) = V5^(M) - -V5+(M), 

where 

n n 



ik 
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whenever all solutions X*^ to the linear programs Gij considered in Ck are 
unique and whenever the set of k nearest neighbors of each histogram rj is 
unique. More generally, by virtue of the property that any optimal solution 
X*- is in the sub-differential dGij{M) of Gij at M we have that 

n n 

regardless of the unicity of the k nearest neighbors of each histogram rj. The 
details of the computation of (M) and one of its subgradients are given 
in Algorithm ([T]). The computations for S^{M) follow the same route; we 

use the abbreviation S^' ^ (M) to consider either of these two cases in our 
algorithm outline. 



Algorithm 1 Subgradient and Objective Computation for S\ ' (M) 

Input: M € Ail. optional: warm starts {Xij}. 

for e £{+-} do 

Compute the optimum z*j and an optimal solution X*j for Prob- 
lem (IPOp . using the network simplex for instance, with cost vector m 
and constraint vector [ri;rj]^:. optional: use warm starts {X^j}. 

end for 

Set G = 0, z = 0. 
for i G {1, • • • , n} do 

Compute the neighborhood set xj-^^' ^ by ranking all z*- in <Sj|_|_ 

for j G n}^'-^ do 
G^G + uj,jX*.. 

Z Z + LOijZ*. 

end for 
end for 

Output z and V = g + g; optional: return current solutions {X*j} as 
warm starts for the next iteration. 



4.2 Localized Linearization of the Concave Part of Ck 

We describe in Algorithm ([2|) a simple approach to minimize Gk locally based 
on a projected subgradient descent and a local linearization of the concave 
part of Ck- Algorithm ([2]) runs a subgradient descent on Gk using two nested 
loops. In the first loop parameterized with variable p, S'^ (the concave part 
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of Ck) and a point V-(- in its subdifferential are computed using the current 
metric Mp. Using this value and the subgradient V+, the concave part 5^ 
of Ck can be locally approximated by its first order Taylor expansion, 

Cfc(M) « Sj^iM) + S^{Mp) +Vl{M- Mp). 

This approximation is convex, larger than Ck and can be minimized in an 
inner loop using a projected gradient descent. When this convex function 
has been minimized up to a sufficient precision, we obtain a point 

Mp+i G argmin5-(M) + S^{Mp) + V^(M - Mp). 
MeMi 

We increment p repeat the step described above. The algorithm terminates 
when sufficient progress in the outer loop has been realized, at which point 
either the matrix computed in the last iteration, or that for which the ob- 
jective has been minimal so far, is returned as the output of the algorithm. 



Algorithm ([2]) fits the description of simplified DC algorithms (jTao and An 



19971 . §4.2) to minimize a difference of convex functions g — h in the case 
where either g or /i is a convex polyhedral function. In this paper both 
functions S'^{M) and -S^{M) are convex polyhedral; The overall quality of 
this local minima is directly linked to the quality of the initial point Mq. 
Choosing a good Mq is thus a crucial factor of our approach. We provide a 
few options to define Mq in Section [H 



5 Initial Points 

Algorithm ([2]) converges to a local minima of in A^i. We argue that this 
local solution can only provide a good approximation of the global minima 
if the initial point Mq itself is already a good initial guess, not too far from 
the global optimum of Ck ■ 



5.1 The li Distance as a Transportation Distance 

The li distance between histograms can provide an educated guess to define 
an initial point Mq to optimize Ck- Indeed, the li distance can be interpreted 
as the Kantorovich-Rubinstein distance^ seeded with the uniform ground 
metric Mt defined as Mt{i,j) = li=j. Because the h distance is itself a 
popular distance to compare histograms, we consider Mj in our experiments 

^Rigorously speaking, both distances are equal up to a factor 2, that is i||r — c||i = 
(r, c) 
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Algorithm 2 Projected Subgradient Descent to minimize Ck 
Input Mq € 7Wi (see Section [5]), gradient step to- 
t ^ 1. 

p ^ 0, M^"* ^ Mq. 
repeat 

Compute V+ and z+ of using Algorithm ([T]) with M°"*. 
repeat 

Compute V_ and of 5^ using Algorithm ([T]) with Af^" and warm- 
starts Xij, G £^ if defined; Set Ajj <— A*- for (i, j) G 
If g = 0, set ^ z^. 
Set 4^ ^ z_ + z+ + V^(m^'^ - m°"*) 

Set 1 ^ Pm, (rni" - ^(V+ + V_)) . 

q + 1. 
t^t + 1. 

until q < Qmax or insufficient progress for z^". 
p p + 1. 

until p < pmax or insufficient progress for 2;°"*. 
Output dp on Mp. 
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to initialize Algorithm ([2]). This starting point does not, however, exploit 
the information provided by the histograms {r^, 1 < i < n} and weights 
{u>ij, G I}- In order to do so, we approximate Ck by a linear function 
of M in Section 15.21 and show that a minimizer of this approximation can 
provide a better way of setting Mg, as shown later in the experimental 
section. 

5.2 Linear Approximations to Ck 

We propose to form an initial point Mq by replacing the optimization un- 
derlying the computation of each distance Gij{M) by a dot product, 

G,,(M)= min (M, X ) « (M, H,,- ) (4) 

X&U(ri,rj) 

where Hjj is a d x d matrix. We discuss several choices to define matrices 
Eij later in Section 15.31 We use these approximations to define the criteria 

Xoo(M) = 2 ^ uj^j{M,Eij) = 2{M, ^ ujijEij) (5) 
in the case where k = oo, or 

n 

n 

= E E "^ij-v + E '^v-ii > (6) 

when k < oo. Note that when k is finite, and only in that case, the k nearest 
neighbors of each histogram r, need to be selected first with a metric; we 
use Ml for this purpose. Although this trick may n ot be satisfactory, we 



observ e that similar approaches have been used by IWeinberger and Saul 



;o seed their algorithms with near neighbors in the initial phase of 



their optimization. Note that such a trick is not needed when k = 00 
which, in practice, seems to yield better results as explained later in the 
experimental section. In both cases where k = 00 and A; < 00, Xfc is a linear 
function of M which, for our purpose, needs to be minimized over A^i: 

min Xk(M) = min (M, ) (P2) 
M&Mi MeMi 
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where is a d x d matrix equal to the relevant sums in the right hand sides 
of either Equation (l5|) or ([6]) depending on the value of k. This problem 
has a linear objective and a convex feasible set. If the norm defining the 
unit ball Bi in Equation ([3]) is the li norm, that is the sum of the absolute 
values of all coefficients in a matrix, then Problem ()P2p is a linear program 
with 0{d^) constraints. For large d, this formulation might be intractable. 
We propose to consider instead the I2 norm unit ball for Bi which yields an 
alternative form for Problem (jPip , where the constraint M € Bi is replaced 
by a regularization term 



min A(M,Hfc ) + IIMII2 = min llAf + -Sfcllo, A>0 (P3) 
M&M ^ ' " A/ex 2 



Brickell et al. I (|2008l . Algorithm 3.1) have proposed recently a triangle fixing 



algorithm to solve problems of the form 

min IIM - i/lb, (P4) 

where i7 is a pseudo-distance, that is a symmetric, zero on the diagonal 
and nonnegative matrix. It is however straightforward to check that each 
of these three conditions, altho ugh intuitive when considering the metric 
nearness problem as d efined in ( Brickell et al. 20081 . §2), are not necessary 



for Algorithm (3.1) in ( Brickell et al.., 20081 . §3) to converge. This algorithm 



is not only valid for non-symmetric matrices H as pointed out by the authors 
themselves, but it is also applicable to matrices H with negative entries and 
non-zero diagonal entries. Problem (|P3|) can thus be solved by replacing H 



by — -fSfc in Problem (jP4]) regardless of the sign of the entries of H. 

We conclude this section by mentioning that other approaches can be 
considere d to minimize th e dot p roduct (M, H ) using alternative norms and 



methods. iFrangioni et al.l (|2005l ) propose for instance to handle linear pro- 



grams in the intersection between the cone of distances and the set of poly- 
hedral constraints {Mj^ + M^j + Mij < 2} which defines what is known as 
the metric polytope. These approaches are however more involved compu- 
tationally and we leave such extensions for future work. 



5.3 Representative Tables 

The techniques presented in Section 15.21 above build upon a linear approxi- 
mation of each function Gij (M) as {M, Hjj ) by selecting a particular matrix 
Eij such that Gij{M) w (M, Eij ). We propose in this section to obtain such 
an approximation by considering an arbitrary and representative transporta- 
tion table in U{r,c). More precisely, we propose to use a simple proxy for 
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the optimal transportation distance: the dot-product of M with a matrix 
that hes at the center of U (r, c) . 



5.3.1 Independence Table 



Many candidate tables ca n qualify as valid centers of gen eral polytopes as 
discussed for instance in (jBovd and Vandenberghd |2004| . §8.5). There is, 
however, a particular table in C7(r, c) which is easy to compute and which 
has been considered as a central p oint of ?7(r, c) in previous work: the 
independence table rc (lGoodlll963l ^. The table rc , which is trivially in 
U (r, c) because rc^lj_ = r and cr^l^ = c, is also the maximal entropy table 
in U (r, c) , that is the table which maximizes 



hiX) 



def 



' ^ Xj,qlOgXpg. 

p,q=l 



(7) 



Using the independence table to approximate Gij, that is using the approx- 
imation 

min {M,F)^rfMrj, 

FeU{ri,rj) 

yields the averages independence tables. 



if A: 



oo. 



if /c < oo. 



(8) 



Note however that this approximation tends to overestimate substantially 
the distance between two similar histograms. Indeed, it is easy to check 
that r^Mr is positive whenever M is a definite distance and r has positive 
entropy. In the case where all coordinates of r are equal to 1/d, r'^Mr is 
simply ||M||i/(i^. 



5.3.2 Typical Table 

Barvinok hold ) argues more recently that most transportation tables are 



close to the so-c alled typical table T of the transportation polytope and not, 
as was hinted by Good (jl963l ). to the independence t able. We briefly e xplain 
the concentration result obtained in ( Barvinok 2O10l . §1.5). Barvinok proves 
that, under the condition that r and c do not have too small coefficients, 
for any table X sampled uniformly on U{r, c) the difference between the 
sum of a subset of coefficients of X and the sum of the same coefficients in 
the typical table T is small with high-probability. Writing S C {1, • • • ,n}'^ 
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for a set of indices, and (Js{X) = J2p qes -^pq corresponding sum of 

coefficients, we have that for sets S big enough, 



P{X€ U{r, c), (1 - s)as{T) < as{X) < (1 + e)as{T))} > 1 - 2de 



-Kd 



where k and £ depends on the smoothness of r and c, that is the magnitude 
of their smalle st coefficients. T he typical table Tij of two histograms rj and 



is defined (jBarvinokl |2010| . §1.2) as the table in the polytope U{ri,rj) 



ndxd 



which maximizes the concave function g : 

d 

g{X) = J2 i^Pi + 1) HXpq + 1) - Xpg HXpq). 

p,q=l 



(9) 



Computing the t ypical tab l e dire ctly is not computationally tractable for 
large values of d. Barvinok ( 20091 . p. 350) provides however a different char- 



acterization of Tij as 



g Up Vq 



p,q<d 



where the vectors u and v in are the unique minimizers of the convex 
program 



T T 

mm r,- u + v 

u,v>0 •' 



(P5) 



p,q=l 

Both gradient and Hessian of the objective of Problem ()P5P have a simple 
form. The Hessian can be expressed in a block form where the diagonal 
blocks are themselves diagonal matrices and the off-diagonal blocks are of 
rank 1. u and v can thus be easily computed using second-order methods. 
The resulting matrices are thus 



if A; 



oo. 



E 



'jeN+iMt)^ij'^ij^ 



if k < oo. 



(10) 



We also note that, as for the independence table, {M, Tij ) will be signifi- 
cantly larger than when and rj are similar. 

Let us provide some intuition on the idea behind using the average typical 
table in Problem ()P3p . The solution to Problem (|P3p will be a metric M 
which will have high coefficients Mpq for any pair of features 1 < p,q < d 
such that is negative, namely a pair {p, q) such that the value of Xpg of a 
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transportation plan X sampled uniformly in each U {ri , rj ) is typically high 
on average for all mismatched histograms pairs. Mpq will be on the contrary 
small for a pair of features {p, q) such that the value of Xpg is typically high 
in transportations plans between similar histograms. 

To recapitulate the results of this section, we propose to approximate 
Ck by a linear function and compute its minimum in the intersection of the 
unit ball and the cone of matrices. This line ar objective can be efficiently 
minimized using a set of tools proposed by ( Brickell et al. 20081 ) adapted 



to our problem. In order to do so, the unit ball considered to define the 
feasible set Mi must be the unit ball of the Frobenius norm of matrices. 
In order to propose such an approximation, we have used the independence 
and typical tables as representative points of the polytopes U{r,c). The 
successive steps of the computations that yield an initial point Mq are spelled 
out in Algorithm ([3]). 

Algorithm 3 Initial Point Mq to minimize 
Set E = 0. 

for z G {1, • • • , n} do 
if /c = oo then 

else 

Compute the neighborhood sets and A''^'^ of histogram using 
an arbitrary distance, e.g. the li distance, 
end if 

forjGA+UA^do 

Compute a center Eij of U{ri,rj), e.g. either the typical or indepen- 
dence table. 

end for 
end for 



Set Mo ^ minA/gx||M+|H||2 using (|Brickell et al.ll2008l . Algorithm 3.1). 



Output Mq. optional: regularize Mq by setting Mq ^ AMo + (l — A)M; 



1- 



6 Related Work 

6.1 Metrics on the Probability Simplex 



Deza and Deza (|2009l . §14) provide an exhaustive list of metrics for probe 
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bility measures, most of which apply to probabihty measures on M and 
When narrowed down to distances for probabiUties on unor dered discrete 
the dominant case in machine learning applications - iRubner et al. 



sets 



(I2OOOI . §2) jropose to split the most commonly used distances into two fam- 
ilies: bin-to-bin distances and cross-bin distances. 

Bin-to-bin distances only compare the d couples of bin-counts (rj, Cj)i=i..d 
independently to form a distance between r and c: the Jensen-divergence, 
\2, Hellinger, total variatio n distances and more generally Csizar /-divergences 



(jAmari and Nagaokal l200ll . §3.2) all fall in this category. Notice that each 
of these distance is known to work usually better for histograms than a 
straightfo rward application of the Euclidean distance as illustrated for in- 
stance in (jChapelle et al.lll999l . Table 4) or in our experimental section. This 
can be explained in theo ry using geometric (Arnari and Naeaoka 200 ll . §3) 



or statistical arguments (jAitchison and Egozcu el l2005l ). 

Bin-to-bin distances are easy to compute and accurate enough to com- 
pare histograms when all d features are sufficiently distinct. When, on 
the contrary, some of these features are known to be similar, either be- 
cause of statistical co-occurrence {e.g. the words Nadal and Federer) or 
through any other form of prior knowledge (e.g. color or amino-acid simi- 
larity) then a simple bin-to-bin comparison may not be accurate enough as 
argued by ( Rubner et al. 200d . §2.2). In particular, bin-to-bin distances are 
large between histograms with distinct supports, regardless of the fact that 
these two supports may in fact describe very similar features. 

Cross-bin distances handle this issue by considering all possible 
pairs {ri,Cj) of cross-bin counts to form a distance. The most simple cross- 



coordinate distance for general vectors in 
family of distances. 



is arguably the Mahalanobis 



d''{x,y) = ^{x-yrn{x-y), 



where J7 is a positive semidefinite d x d matrix. The Mahalanobis distance 
between x and y can be interpreted as the Euclidean distance between Lx 
and Ly where L is a Cholesky factor of or any square root of fi. Learning 
such linear maps L or directly using labeled information has been the 
subject of a substantial amount of research in recent years. We briefly 
review this literature in the following section. 

6.2 Mahalanobis Metric Learning 

( 2OO3I ) ■ followed by Weinberger et al. ( 20061 ) and Davis et al. 



Xing et al 



(|2007l ) have proposed different algorithms to learn the parameters of a Maha- 
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lanobis distance, that is either a positive semi-definite matrix 17 or a hnear 
map L. These techniques define first a criterion C and a feasible set of 
candidate matrices to obtain, through optimization algorithms, a relevant 
matri x or L. The criteria we propose in Section [3] are modeled along these 
ideas. Weinberger et al. ( 20061 ) were the first to consider criteria that only 



use nearest neighbors, which inspired in this work the proposal of Ck for 
finite values of k in Section 13.21 as opposed t o considering the average over 



all possible distances as in (IXing et al.ll2003l ) for instance. 



We would like to insist at this point in the paper that Mahalanobis 
metric learning and ground metric learning have very little in common con- 
ceptually: Mahalanobis metric learning algorithms produce a d x d positive 
semidefinite matrix or a linear operator L. Ground metric learning pro- 
duces instead a metric matrix M. These sets of techniques operate on very 
different mathematical objects. 

It is also worth mentioning that although Mahalanobis distances have 
been designed for general vectors in M*^, and as a consequence can be applied 
to histograms, there is however, to our knowledge, no statistical theory which 
motivates their use on the probability simplex. 

6.3 Metric Learning in S^.i 

Lebanon (|2006l ) has proposed to learn a bin-to-bin distance in the probability 



simplex using a parametric family of distances parameterized by a histogram 
A S defined as 



d 



..(,..c)=™„s(E\/sai# 



, r'X V A 

1=1 



T his formula can b e simplified by using the perturbation operator proposed 



bv lAitchisonI (|l986l . p.46): 

def 1 
^A 



Vr, AgS^-i, r Q X = -^{riXi,- ■ ■ ,rdXd)^ 



AitchisonI argues that the perturbation operation can be naturally inter- 
preted as an addition operation in the simplex. Using this notation, the 
distance dx{r, c) becomes the simple Fisher metric applied to the perturbed 
histograms r A and c A, 

d\{r,c) = arccos(\/ r A, Vc A ). 

Using arguments related to the fact that the dist ance shou l d vary accordingly 
to the density of points described in a dataset, Lebanon ( 20061 ) proposes to 
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learn this perturbation A in a semi-supervised context, that is making only 
use of observed histograms but no other side-information. Because of this 
key distinction we do not consider this approach in the experimental section. 



7 Experiments 

We provide in this section a few details on the practical implementation of 
Algorithms ([T]) , ^ and ^ . We follow by presenting empirical evidence that 
ground metric learning improves upon other state-of-the-art metric learning 
techniques when considered on normalized histograms. 



7.1 Implementation Notes 

Algorithms ([T|), ([2]) and ([3]) were implemented using several optimization 
toolboxes. Algorithm ([1]) requires the computation of several transporta- 
tion problems. We use the CPLEX Matlah API implementation of network 
flows with warm starts to that effect. The computational gains we obtain by 
using the API, instead of using a function call to the CPLEX matlah toolbox 
or to the Mosek solver are approximately 4 fold. These benefits come from 
the fact that only the constraint vector in Problem ()POp needs to be updated 
at each iteration of the first loop of Algorithm ([T]) . We use the metricNear- 
ness toolbox^ to carry out both the projections of each inner loop iteration 
of Algorithm ([2]) , as well as the last minimization of Algorithm ([3]) . We com- 
pute Typical tables using the fminunc Matlab function, with the gradient 
and the Hessian of the objective of Problem (jPSp provided as auxiliary func- 
tions. Since fminunc is by definition an unconstrained solver, its solution 
is kept if both u* and v* satisfy positivity constraints, which is the 
case for a large majority of pairs of histograms. Whenever these constraints 
are violated we optimize again this problem by using a slower constrained 
Newton method. 



7.2 Images Classification Datasets 

We sample randomly 2N classes taken in the Caltech-256 family of images 
and consider 70 images in each class. Each image is represented as a normal- 
ized histogram of GIST features, obt ained using the LEAR implementation^ 
of GIST features dPouze et ahlbnOfll V These features describe 8 edge direc- 



tions at mid-resolution computed for each patch of a 4 x 4 grid on each 

^http : //people .kyb.tuebingen. mpg .de/ suvrit/work/progs/metricn.html 
http : / /lear . inrialpes . f r/ software 
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image. Each feature histogram is of dimension d = 8x4x4 = 128 and 
subsequently normahzed to sum to one. 

We spht these classes into two sets of classes, (ai, • • • , cat) and {qn+i, ■ ■ ■ 
and study the resulting A^^ binary classification problems that arise from 
each pair of classes in 

{ai, • • • , qn] X {oAT+i, • • • , a2Ar}. 

For each of these A^^ binary classification task we split the 70 + 70 points 
from both classes into 30 + 30 points to form a training set and 40 + 40 
points to form a test set. This amounts to having n = 60 training points 
following the notations introduced in Section 13.11 



7.3 Distances used in this benchmark 
7.3.1 Bin-to-bin distances 

We consider the li,l2 and Hellinger distances on GIST features vectors, 

li{r,c) = \\r - c\\i, l2{r,c) = \\r - c\\2, T-L{r,c) = \\^/r - ^/c\\2, 

where ^/r is the vector whose coordinates are the squared root of each co- 
ordinate of r. 



7.3.2 Mahalanobis distances 

We use the public implement ations of LMNN (jWeinberger and SaulllioogI ) 
and ITML (jPavis et al.l 120071 ) to learn two different Mahalanobis distances 
for each task. We run both algorithms with default settings, that is fc = 3 for 
LMNN and k = 4 for ITML. We use these algorithms on the Hellinger repre- 



sentations { 



, n} of all histograms originally in the training set. 



We have considered this representation because the Euclidean distance of 
two histo grams using the Hellinger m ap corresponds exactly to the Hellinger 
distance (lAmari and Nagaokall200ll . p. 57). Since the Mahalanobis distance 
builds upon the Euclidean distance, we argue that this representation is more 
adequate to learn Mahalanobis metrics in the probability simplex. The sig- 
nificant gain in performance observed in Figure [2] that is obtained through 
this simple transformation confirms this intuition. 



7.3.3 Ground Metric Learning 

We learn ground metrics using the following settings. In each classification 
task, and for two images and r^, 1 < i ^ j < 60, each weight uoij is set 
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to 1 if both histograms come from the same class and to —1 if they come 
from different classes. The weights cjjj are further normalized to ensure 
that j)e£'+ ^ij = 1 ^i^d Yl{ij)&e- ~ ^ consequence the total 

sum of weights ^^{i j)&s^ii equal to 0. The neighborhood parameter k 
is set to 3 to be directly comparable to the same parameter used for ITML 
and LMNN. The subgradient stepsize to of Algorithm ([2]) is set to = 0.1, 
guided by preliminary experiments and by the fact that, because of the 
normalization of the weights ojij introduced above, both the current iteration 
Mfc in Algorithm ([2]) and the gradient steps V+ or V_ all have comparable 
norms as matrices. We perform a minimum of 50 x 0.8^ gradient steps in each 
inner loop and set Pmax to 8. Each inner loop is terminated when the progress 
is too small, that is when the objective does not progress more than 0.75% 
every 6 steps, or when q reaches ^max = 200. We compute initial points Mq 
using different representative tables as described in Algorithm ([3|). Figure U] 
illustrates the variation in performance for different choices of Mq. There is 
no "natural" distance between GIST features that we could consider. We 
have tried a few, taking for instance distances based on the 8 directions 
and 4 X 4 = 16 possible locations in the grid described by the 128 GIST 
features, but we could not come up with one that was competitive with any 
of the methods considered above. This situation illustrates our claim in the 
introduction that GML can select agnostically a metric for features without 
using any prior knowledge on the features. 



7.4 Comparison with the SVM basehne 

For each distance d, we consider its exponentiat ed kernel exp(— d/cr) and 



use a Support Vector Machine (SVM) classifier (jCortes and Vapnild Il995l ) 
on the 80 test points estimated with the 60 training points. We use the 
following parameter grid to train the SVM's : the regularization parameter 
C is selected within the range {1, 10, 100}; the bandwidth parameter a is 
selected as a multiple of the median distance d computed in the training set, 
that is a is selected within the range {.1, .2, .5, 1, 2, 5} x med{d{ri,rj)}. We 
use a 4 folds (testing on left-out fold) and 2 repeats cross validation proce- 
dure on the training set to select the parameter pair that has lowest average 
error. Transportation distances are not negative definite in the general case, 
which is why we add a sufficient amount of diagonal regularization (minus 
the smallest negative eigenvalue) on the resulting 60 x 60 Gram matrices to 
ensure that they are positive definite in the training phase. 
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Distance Accuracy in Test Set K-Nearest Neighbor Classification Error on Test Foid 




Number of retrieved Neigfibours k parameter in k-NN 



Figure 1: (left) Accuracy of each considered distance on the test set as 
measured by the average proportion, for each datapoint in the test set, of 
points coming from the same class within its n nearest neighbors. These 
proportions were averaged over 25 x 25 = 625 classification problems using 
40 + 40 test points in each experiment. The ground metric in GML and 
Mahalanobis matrices in ITML and LMNN have been learned separately 
using a train set of 30 + 30 points, (right) k-NN classification error using 
the considered metrics and the 30 + 30 points in the training fold, both to 
compute the metrics when needed and to compare test points. These results 
are also averaged for 40+40 test points in each of the 625 classification tasks. 
By Mq = TypoQ we mean that the initial point was computed using typical 
tables and /c = oo in Algorithm ([3]). 



22 



7.5 Results 



The most important results of this experimental section are summarized in 
Figure [H which displays , for all considered distances, their average recall 
accuracy on the test set and the average classification error using a K-nearest 
neighbor classifier. These quantities are averaged over 

Ar2 = = 625 bi- 
nary classifications. In this figure, GML used with EMD is shown to provide, 
on average, the best possible performance: the left hand figure considers test 
points and shows that, for each point considered on its own, GML-EMD se- 
lects on average more same class points as closest neighbors than any other 
distance. The performance gap between GML-EMD and competing dis- 
tances increases significantly as the number of retrieved neighbors is itself 
increased. The right hand figure displays the average error over all 625 tasks 
of a K-nearest neighbor classification algorithm when considered with all dis- 
tances for varying values of k. In this case too, GML combined with EMD 
fares much better than competing distances. The average error when using 
a SVM with these distances is represented in the legend of the right-hand 
side figure. Our results agree with the general observation in metric learning 
that Support Vector Machines perfor m usually better than K-nea rest neigh- 
bor classifiers with learned metrics ( Weinberger and Saull liooil . Table 1). 
Note however that the K-nearest-neighbor classifier seeded with GML-EMD 
has an average performance that is directly comparable with that of support 
vector machines seeded with the li or I2 kernels. 

It is also worth mentioning as a side remark that the I2 distance does 
not perform as well as the h or Hellinger distances on these datasets, which 
validates our earlier statement that the Euclidean geometry is usually a poor 
choice to compare histograms directly. This intuition is further validated 
in Figure [H where Mahalanobis learning algorithms are show to perform 
significantly better when they use the Hellinger representation of histograms. 

Figure [3] shows that the performance of GML can vary significantly de- 
pending on the neighborhood parameter k. We have also considered as a 
ground metric the initial point Mq = Typ^^ obtained by using typical tables 
and k = 00 with Algorithm [3l The corresponding results appear as EMD 
Typoo curves in the figure. These curves are far below those correspond- 
ing to GML-EMD A; = 3 (Mq =Typoo). This gap illustrates the fact that the 
subgradient descent performed by Algorithm [2] does have a real impact on 
performance, and that choosing a good initial point in itself is not enough. 

Finally, Figure H] reports additional performance curves for different ini- 
tial points Mq. T hese exper iments tend to show that, despite their compu- 
tational overhead, iBarvinokl 's typical tables seem to provide a better initial 
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point than independence tables in terms of average performance. Please 
note that = 25 in Figures [1] and [21 and = 10 in Figures [3] and HI 



8 Conclusion and Future Work 
8.1 Overview 

We have proposed in this paper an approach to tune adaptively the unique 
parameter of transportation distances, the ground metric, given a training 
dataset of histograms. This approach can be applied on any type of fea- 
tures, as long as a set of histograms along with side-information, typically 
labels, are provided for the algorithm to learn a good candidate for the 
ground metric. The algorithms performs a projected subgradient descent 
on a difference of convex functions, and can only find local minimizers. We 
propose a few initial points to compensate for this arbitrariness, and show 
that our approaches provide, when compared to other competing distances, 
a superior average performance for a large set of image binary classification 
tasks using GIST features histograms. 



8.2 Ground Metric and Computational Speed 



We have argued in the introduction of this paper that the ground met- 
ric was never considered so far as a parameter that could be learned from 
data. The ground me tric has however a.ttrac te d a lot of attention recentl y 
for a different reason. iLing and Okadal (120071 ) , iGudmundsson et al.l (j2007l ) , 
Pele and WermanI ([20091) , [Ba et al.l (120 ll[ ) have all recently argued that the 
computation of the EMD can be dramatically sped up when the ground met - 
ric matrix has a certain structure. For instance, Pele and Werman ( 20091 ) 
have shown that the computation of each earth mover's distance can signif- 
icantly reduced whenever the larger values of any arbitrary ground metric 
are thresholded to a certain level. Ground metrics that follow such con- 
straints are attractive because they result in transportation problems which 
are provably faster to compute. Our work in this paper suggests on the 
other hand that the content (and not the structure) of the ground metric 
can be learned to improve classification accuracy. We believe that the com- 
bination of these two viewpoints could result in transportation distances 
that are both adapted to the task at hand and fast to compute. A strategy 
to achieve both goals would be to enforce such structural constraints on 
candidate metrics M when looking for minimizers of criteria Ck- 
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Figure 2: The experimental setting in this figure is identical to that of 
Figure [H except that only two different versions of LMNN and ITML are 
compared with the Hellinger distance. This figure supports our claim in 
Section 17.3.21 that Mahalanobis learning methods work better using the 
Hellinger representation of histograms, {y^, i = ,n}, rather than 

their straightforward representation in the simplex {ri}i=i^... ^n- 
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Figure 3: Distance accuracy and K-nearest neighbor classifier error for GML- 
EMD using different values for the neighborhood parameter k. The initial- 
ization setting Mq = Typ^^ is explained in the caption of Figure [H The 
last curve, EMD Typoo displays the results corresponding to the EMD with 
a ground metric set directly to the output of Algorithm ([3]) using typical 
tables and k = oo. The difference in performance between these curves and 
that of GML-EMD A; = 3, (Mq = Typ^^) illustrates the progress achieved by 
Algorithm [2j The performance curves also agree with the intuition that the 
GML metric set at a given neighborhood parameter k has a comparative ad- 
vantage over other GML metrics when the k parameter of nearest neighbor 
classifiers is itself close to k. The experimental setting is identical to that 
of Figure [H except that experiments were averaged here over 10x10 = 100 
classification problems, instead of 625. There is no overlap between these 
two sets of binary classifications. 
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Figure 4: Distance accuracy and K-nearest neighbor performance for GML- 
EMD set at k = 3 using different initial points described in Section [5l and 
particularly Section [5.3.2[ This figure shows that, despite its computational 
overhead, initializing Algorithm [2] using typical tables performs better on av- 
erage than using independence tables. The experimental setting is identical 
to that of Figure [3l 



27 



The mot computationally intensive part of Algorithm ([2]) lies in the re- 
peated calls to Algorithm ([T|) which itself computes a number 0{'n?) optimal 
transportations between pairs of histograms, where n is the size of the train- 
ing set. We have used the network simplex algorithm with warm starts to 
carry out these computations , but faster alterna t ives a long the lines of the 
implementations provided by Pele and Werman ( 20091 ) could provide com- 
putational improvements. The EMD is also known to accept lower bounds 
in particular cases. An efficient lower bound would reduce considerably the 
complexity of Algorithm (JT]) by narrowing down the computation of Optimal 
transportations to a smaller subset of neighbor candidates. 



8.3 Future Applications 

We have proposed in this paper a set of techniques to learn ground met- 
rics using histograms of arbitrary features. We believe these techniques 
will prove useful to st udy histograms of latent features, such as topics 
Blei and Laffertv ( 2009) or Dictionaries ( Kreutz-Delgado et ahllioosl . Mairal et al, 
20091 . Ijenatton et al.ll201ol ). for which natural metrics are not always avail- 
able. 
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